In [1]:
# Run before lecture to load datasets and do simple prep
library(tidyverse) #all our data wrangling/plotting
options(repr.matrix.max.rows = 6)
# Making scatter points a bit bigger so that students can see them
update_geom_defaults("point", list(size = 3))

#Mauna Loa
co2_df <- tibble(
    concentration = as.vector(co2),
    date = lubridate::date_decimal(as.numeric(time(co2)))
)

#Top 12 Island landmasses
islands_df <- enframe(islands)
colnames(islands_df) <- c('landmass', 'size')
islands_df = top_n(islands_df, 12, size)

continents <- c('Africa', 'Antarctica', 'Asia', 'Australia', 'Europe', 'North America', 'South America')
islands_df <- mutate(islands_df, is_continent = ifelse(landmass %in% continents, 'Continent', 'Other'))

gapminder <- read_csv("data/gapminder.csv")
gapminder_2016 <- gapminder  |>
    select(country, year, continent, life_expectancy) |>
    filter(year == 2016)

#old faithful, mtcars -- nothing to do
Warning message:
“package ‘ggplot2’ was built under R version 4.3.2”
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 10545 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): country, continent, region
dbl (6): year, infant_mortality, life_expectancy, fertility, population, gdp

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

DSCI 100 - Introduction to Data Science¶

Lecture 4 - Data visualization in R¶

Attribution: images in these slides that are not accompanied by code mostly come from
The Fundamentals of Data Visualization by Claus O. Wilke

No description has been provided for this image

Artwork by @allison_horst

Today: Visualization¶

No description has been provided for this image

image source: R for Data Science by Grolemund & Wickham

Designing a visualization: ask a question, then answer it¶

The purpose of a visualization is to answer a question about a dataset of interest.

A good visualization answers the question clearly. A great visualization also hints at the question itself.

Visualizations alone help us answer two types of questions:

  • descriptive: What are the largest 7 landmasses on Earth?
  • exploratory: Is there a relationship between penguin body mass and bill length?
  • inferential
  • predictive
  • causal
  • mechanistic

(we need more tools + visualizations to answer the others)

  • Descriptive: A question which asks about summarized characteristics of a data set without interpretation (i.e., report a fact). (describe characteristics)

  • Exploratory: A question asks if there are patterns, trends, or relationships within a single data set. Often used to propose hypotheses for future study. (discovery of ideas and thoughts)

  • inferential: determine if association observed in your exploratory analysis hold in a different sample that is rep of pop (infew what is true)

  • predictive: what predicts whether someone will eat a certain diet

  • causal: whether changing one factor will change another factor

  • mechanistic: how e.g. how diet leads to a reduction in the number of viral illnesses

Creating visualizations in R¶

  • It's an iterative procedure. Try things, make mistakes, and refine!

  • We will use ggplot2. There are three key aspects of plots in ggplot2:

    1. aesthetic mappings: map dataframe columns to visual properties
    2. geometric objects: encode how to display those visual properties
    3. scales: transform variables, set limits
  • Add these one by one using +

Types of variables¶

A variable refers to a characteristic of interest and can be:

  1. categorical: can be divided into groups (categories) e.g. marital status
  2. quantitative: measured on a numeric scale (usually units are attached) e.g. height

Scatter Plots¶

To visualize the relationship between two quantitative variables

e.g. Is there a relationship between horsepower and fuel economy of an engine? Does the number of cylinders affect that relationship?

In [2]:
# Load libraries for wrangling and plotting
library(tidyverse)
In [3]:
# Inspect the data
mtcars
A data.frame: 32 × 11
mpgcyldisphpdratwtqsecvsamgearcarb
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
Mazda RX421.061601103.902.62016.460144
Mazda RX4 Wag21.061601103.902.87517.020144
Datsun 71022.84108 933.852.32018.611141
⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮
Ferrari Dino19.761451753.622.7715.50156
Maserati Bora15.083013353.543.5714.60158
Volvo 142E21.441211094.112.7818.61142
In [4]:
# Set the default size for all plots
options(repr.plot.width = 10, repr.plot.height = 8)

# Is there a relationship between fuel economy and horsepower?

Build up one-by-one:

# Base 
ggplot(mtcars, aes(x = hp, y = mpg)) +
    geom_point()
    
# It's difficult to see the text so let's make it bigger 
theme(text = element_text(size = 26))

As horsepower increases miles per gallon (fuel efficiency) tends to decrease (negative relationship). But is this true for all cars? Can we group the data in some way to find out more? What about per the number of cylinders (the size) of the engine?

# Color per cylinder
ggplot(mtcars, aes(x = hp, y = mpg, color = cyl))

# Change color to factor/categorical
ggplot(mtcars, aes(x = hp, y = mpg, color = factor(cyl)))

Cars with more cylinders tend to have higher horsepower and lower fuel efficiency. We can make this plot easier to understand by adding axis labels:

# Add labels
labs(x='Horsepower', y='Miles per Gallon', color='Cylinders')

# Full final plot
ggplot(mtcars, aes(x = hp, y = mpg, color = factor(cyl))) +
    geom_point() + 
    theme(text = element_text(size = 26)) +
    labs(x='Horsepower', y='Miles per Gallon', color='Cylinders')

Line Plots¶

To visualize trends with respect to an independent quantity

e.g. How has atmospheric carbon dioxide changed over the last 40 years?

No description has been provided for this image

Mauna Loa Research Station

In [5]:
# Inspect the data
co2_df
A tibble: 468 × 2
concentrationdate
<dbl><dttm>
315.421959-01-01 00:00:00
316.311959-01-31 10:00:00
316.501959-03-02 20:00:00
⋮⋮
360.831997-10-01 18:00:00
362.491997-11-01 04:00:00
364.341997-12-01 14:00:00

dttm = Datetime column type

In [6]:
# Change the default text size for all plots
theme_set(theme_gray(base_size = 26))

# How does atmospheric CO2 concentration change over time?

Start with geom_point as we just learned:

ggplot(co2_df, aes(x=date, y=concentration)) +
    geom_point()

The visualization shows a clear upward trend in the atmospheric concentration of CO2 over time. However, something is not quite right here; ask students what wrong with this plot (overplotting).

Switch to geom_line:

ggplot(co2_df, aes(x=date, y=concentration)) +
    geom_line()

Share additional conclusion: The concentration seems to oscillate as well.

Optionally add labels or just mention it:

    labs(x = 'Date', y = 'CO2 Concentration')

Bar Plots¶

To visualize the comparison of amounts

e.g. Which are the largest 12 island landmasses on Earth? Are they all continents or are there some other islands with large landmasses as well?

No description has been provided for this image Source: Wiktionary
In [7]:
# Inspect the data
islands_df
A tibble: 12 × 3
landmasssizeis_continent
<chr><dbl><chr>
Africa 11506Continent
Antarctica 5500Continent
Asia 16988Continent
⋮⋮⋮
New Guinea 306Other
North America9390Continent
South America6795Continent
In [8]:
# What are the largest 12 island landmasses on Earth?

Simplest approach first:

ggplot(islands_df, aes(x = landmass, y = size)) +
    geom_bar()

Challenging question: Think back to the pre-reading, why do we get an error message? Because geom_bar is only for counts by default:

ggplot(islands_df, aes(x = is_continent)) +
    geom_bar()

To make geom_bar use the y aesthetic, we need to change the stat option to identify (use the value as is, don't count anything).

ggplot(islands_df, aes(x = landmass, y = size)) +
    geom_bar(stat = 'identity')

The x-axis labels are overlapping. It is possible to rotate them, but that would make them hard to read. A more effective approach is to change so that the labels are on the y-axis.

ggplot(islands_df, aes(x = size, y = landmass)) +
    geom_bar(stat = 'identity')

To make the plot easier to read, we can reorder the bars by size. Generally we put the largest bar closest to the axis, but this is not a hard rule.

ggplot(islands_df, aes(x = size, y = reorder(landmass, -size))) +
    geom_bar(stat = 'identity')

Mention that we should change the labels to meaningful names and optionally show it. Optionally color bars by continent as the last step.

Histograms¶

To visualize the distribution of a single quantitative variable

e.g. Was there a difference in life expectancy across different continents in 2016?

In [9]:
# Inspect the data
gapminder_2016
A tibble: 185 × 4
countryyearcontinentlife_expectancy
<chr><dbl><chr><dbl>
Albania2016Europe78.1
Algeria2016Africa76.5
Angola 2016Africa60.0
⋮⋮⋮⋮
Yemen 2016Asia 64.92
Zambia 2016Africa57.10
Zimbabwe2016Africa61.69
In [10]:
# Was there a difference in life expectancy across different continents in 2016?

First let's create a histogram of all countries' life expectancies.

ggplot(gapminder_2016, aes(x = life_expectancy)) +
    geom_histogram()

Then color by continent. Note that we need to use fill since color is just for the outline of each rectangle.

ggplot(gapminder_2016, aes(x = life_expectancy, fill = continent)) +
    geom_histogram()

The position of the bars defaults to "stack" that they are stacked on top of each other - not very easy to read. Setting the position to identity will overlay the histograms on top of each other so that they all have the same baseline. When doing this we also need to add transparency.

ggplot(gapminder_2016, aes(x = life_expectancy, fill = continent)) +
    geom_histogram(position = "identity", alpha = 0.6)

This is still not an effective way to convey this information, so let’s try a different strategy of creating multiple separate histograms instead using facetting.

ggplot(gapminder_2016, aes(x = life_expectancy, fill = continent)) +
    geom_histogram() +
    facet_grid(rows = vars(continent))

That looks much better and is easy to compare! (Mention that axis labels should be changed to be human readable and optionally show it as well as how to make the plot taller)

options(repr.plot.width = 10, repr.plot.height = 12)
...
    labs(x = "Life Expectancy (years)", y = "Count", fill = "Continent")

A few rules of thumb for creating effective visualizations¶

Rule of Thumb: No tables / pie charts / 3D¶

No description has been provided for this image

Which one is easier to interpret? Pie graph - colours don't mean anything (unneccessary)

  • hard to see size of slices relative to the other slices

Rule of Thumb: No tables / pie charts / 3D¶

No description has been provided for this image No description has been provided for this image

  • the third dimension does not improve the reading of the data
  • these plots are difficult to interpret because of the distorted effect of perspective associated with the third dimension.
  • 3D is discouraged for charts in general, and should only be used for very specific applications
  • the bars or slices in a pie graph that are closer to the reader appear to be larger than those in the back due to the angle at which they're presented

Rule of Thumb: Use simple, colourblind-friendly colour palettes¶

No description has been provided for this image No description has been provided for this image

Rule of Thumb: Include labels and legends, make them legible¶

Remember: a great visualization tells its own story without needing you to be there explaining thingsNo description has been provided for this image No description has been provided for this image

No description has been provided for this image No description has been provided for this image

In [11]:
options(repr.plot.width = 4, repr.plot.height = 4)
diamond_plot <- ggplot(diamonds, aes(x = carat, y = price)) +
    geom_point() +
    xlab("Size (carat)") +
    ylab("Price (US dollars)")
diamond_plot
No description has been provided for this image

Add alpha = 0.2 to geom_point

  • transparency setting between [0,1]
  • too many colours (overwhelming)

  • less is more

  • Make sure to use colourschemes that are understandable by those with colourblindness. For example, the RColorBrewer R library provide the ability to pick such colourschemes, and you can check your visualizations after you have created them by uploading to online tools such as the colour blindness simulator.

  • Redundancy can be helpful; sometimes conveying the same message in multiple ways reinforces it for the audience. For instance you can also consider using shapes to represent different groups

Go and create!¶

What did we learn today?¶

TidyTuesday!¶

Weekly practice exploring and visualizing data: https://github.com/rfordatascience/tidytuesday